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ABSTRACT 
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Data  from  a  bivariate  distribution  is  often  graphically  presented  by  using  a 
scatter  plot  Adding  a  suitable  depiction  of  the  marginal  data  distributions  to  the  edges 
of  the  scatter  plot  allows  interesting  features  of  the  marginal  data  distributions  to  be 
seen  alongside  the  original  bivariate  data.  We  propose  providing  these  marginal  dep¬ 
ictions  by  a  modification  of  Quantile-Quantile  plots  (QQ  plots)  we  call  scrawl  strips. 

We  also  propose  adding  a  strip  or  axis  showing  location  of  the  letter  values,  or  the 
broadened  letter  values,  which  we  call  a  letter  strip,  at  b-letier  strip,  reqtectively. 
These  two  strips  can  be  combined. 

Since  scrawl  strips  ate  modified  QQ  plots,  Aey  inherit  useful  features  of  QQ 
plots.  These  features  include  infinmation  about  the  shape  of  the  marginal  distribution, 
the  presence  and  type  cf  skewness,  the  presence  of  heavy  tails,  gaps  and  ties  in  the 
ordtted  data,  appearances  of  bimodality  or  high  shoulders,  and  assumptions  of  mxmal- 
ity.  This  type  of  information  is  often  either  absent  as  depicted  poorly,  both  in  univari- 
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Introduction. 

Data  from  a  bivariate  distribution  is  often  grt^hically  i^esented  using  a  scatter  plot  Normally, 
one  is  also  interested  in  the  univariate  marginal  distributions  of  the  data;  but  unfortunately,  the  scatter 
plot  often  does  not  show  interesting  features  of  these  marginals.  In  particular,  the  shapes  of  the  margi¬ 
nal  distributions,  ties  in  values  in  one  variable,  and  gaps  in  the  sequence  of  values  of  one  variable  may 
be  difficult  or  impossible  to  see  in  the  scatto-  plot  Bimodality  in  either  marginal  may  or  may  not  be 
easy  to  see.  Many  of  these  properties  may  be  important  for  example,  the  shapes  of  the  marginal  data 
distributions  may  have  crucial  implications  in  the  type  of  data  analysis  we  wish  to  select  especially 
when  conventional  assumptions  of  Gaussiani^  can  be  seen  to  be  unjustified. 

Ways  to  look  at  these  marginal  data  distributions  include  plotting  separate  histograms  (Tufte 
(1983)),  rootograms,  hanging  or  suspended  rootograms  (explained  in  Tukey  (1965)),  boxplots,  displays 
of  letter  values  (median,  hinges,  eighths,  etc.),  or  Quantile-Quantile  plots  (QQ  plots).  The  type  of 
plot  we  are  intoested  in  is  a  scatter  plot  of  univariate  ordered  data  (the  otAes  statistics  or  quantiles) 
against  the  expected  (mean)  or  anticipated  (median)  values  of  these  order  statistics  for  a  chosen  distri¬ 
butional  assumption  (for  example.  Normal  scores). 

It  would  be  desirable  to  presmt  this  univariate  information  alongside  the  scatter  plot,  so  one 
could  directly  compare  features  of  both  the  bivariate  data  distribution  and  the  marginal  data  distribu¬ 
tions.  Other  authors  (see  Chambers  et  al.  (1983))  have  used  the  "alongside"  idea  by  including  boxplots 
or  jittered  dot  strips  (discussed  later)  of  the  X  data  in  a  strip  above  the  scatter  plot,  using  the  X-axis  as 
a  scale.  Similarly,  they  plot  the  Y  data  alongside  the  vertical  axis  of  the  scatter  plot.  Tufte  (1983)  has 
advocated  similar  ideas. 

Plausible  reasons  for  depicting  marginals. 

Since  presenting  the  marginal  data  distributions  in  strips  alongside  a  scatter  plot  appears  useful,  it 
may  be  desirable  to  base  the  presentation  of  these  marginals  on  a  different  method  than,  say.  boxplots 
or  histograms.  Before  choosing  a  method  we  should  discuss  what  features  are  desirable  in  dq^icting  a 
univariate  distribution  of  data.  Some  desirable  features  in  depicting  univariate  data  include; 
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(1)  easy  detection  of  outliers; 

(2)  easy  semi-quantitative  assessment  of  center  and  spread  of  the  distribution; 

(3)  depiction  of  the  general  shape  of  the  distribution  (this  includes  presence  and  type  of  skewness  or 
lack  of  symmetry,  heavy  or  squashed  tails,  and  possibly  comparing  the  distribution  with  a  refer¬ 
ence  distribution  such  as  the  Gaussian); 

(4)  detection  of  unusual  gaps  in  the  sequence  of  onlered  data  values; 

(5)  detection  of  ties  in  data  values; 

(6)  nodceability  oi  apparent  bimodality,  or  of  high  shoulders  compared  to  the  refoence  distribution 
(which  is  appropriately  called  relative  bimodality). 

One  wishes  to  choose  a  method  which  stresses  those  features  of  the  marginal  distributions  the 
scatter  plot  will  depict  weakly  or  not  at  all.  Outliers  will  often  be  clearer  on  a  scatt'^  plot  than  in  the 
margin^  distributions  (for  a  discussion,  see  Tufte  (1983),  p.l4),  so  item  (1)  seems  to  be  the  least  impor¬ 
tant  criterion  in  choosing  a  method  of  presentation.  While  one  can  try  to  visually  estimate  center  and 
spread  of  the  marginals  from  the  scatter  plot,  one  will  usually  get  rather  poor  estimates,  so  item  (2)  is 
of  moderate  importance.  By  contrast,  item  (3)  is  almost  never  apparent  on  any  scatter  plot  and  is  often 
of  great  interest,  so  it  will  be  very  important  in  choice  of  meth^  of  presentation.  Of  the  other  items, 
(4)  and  (5)  will  be  difficult  to  see  on  the  scatter  plot,  and  item  (Q  may  or  may  not  be  apparent.  It 
seems  to  be  generally  true  that  techniques  effective  in  dq>ictmg  the  broader  aspects,  such  as  items  (2) 
or  (3).  do  not  serve  so  well  in  depicting  narrower  aspects,  like  items  (4),  (5),  and  (6).  so  we  shall  attack 
broader  and  narrower  aspects  separately. 

Dealing  with  the  broader  aspects. 

Histograms  easily  suggest  center  and  spread  of  the  distribution,  albeit  sometimes  crudely.  They 
seem  often  to  give  a  reasonable  indication  of  symmetry.  The  presence  of  heavier-than-Gaussian  tails 
may  be  crucial  to  the  type  of  analysis  chosen  (e.g.  tests  based  on  die  usual  correlation  coefficient  are 
likely  to  be  non-conservative  if  the  marginals  are  heavy-tailed),  but  histograms  will  not  easily  detect  the 
presence  of  such  heavy-tailedness.  Histograms  may  give  no  information  about  g^s  and  ties,  particu¬ 
larly  if  the  interval  sizes  are  not  chosen  cleverly,  or  if  the  gaps  are  small.  In  fact,  presence  of  ties  or 
gaps  in  the  distribution  may  give  rise  to  very  different  histograms,  depending  on  the  particular  choice  of 
interval  widths  and  placement  An  uiducky  choice  of  interval  widths  or  placement  may  also  disguise 
the  presence  of  bimodality  (with  regard  to  a  rectangular  distribution),  or  of  high  shoulders  (with  regard 
to  the  reference  distribution).  The  heights  of  the  bars  are  not  striedy  comparable,  since  diey  have 
different  variances.  Finally,  it  is  not  clear  how  to  modify  a  histogram  so  as  to  fit  in  a  small  strip 
without  losing  its  desirable  characteristics.  Rootograms  retain  many  of  these  defects,  though  die  heights 
of  the  rootogram  bars  are  variance  stabilized. 

A  modification  of  the  suspended  rootogram,  see  Tukey  (1965),  where  one  displays  only  the  varia¬ 
tion  of  the  oars  about  the  baseline,  can  be  di^layed  in  a  narrow  strip.  This  plot  wiU  do  moderately 
well  in  presenting  information  about  sluqie  (item  (2)).  since  the  b^  are  now  compared  with  an 
expected  or  fitted  reference  distribution,  and  the  variance  of  the  bar  lengths  is  stabilized.  Because  only 
part  of  the  suspended  rootogram  is  displayed,  such  a  plot  requires  some  sophistication  to  understand  and 
interpret  effectively. 

One  can  also  present  the  marginal  values  along  the  edges  of  a  scatter  plot  by  marking  or  jittering 
these  values  (an  explanation  of  jittering  is  in  Chambers  et  aL,  Chapter  2  (1983)),  or  one  may  present 
informatkm  about  these  values  by  presenting  boxplots  .  Both  the%  alternatives  will  give  some  idea 
about  symmetry,  but  will  do  worse  than  histograms  in  depicting  other  parts  of  item  (3).  Boxplots  give 
a  better  feeling  for  estimating  center  and  spread  than  merely  marking  or  jittering  the  values.  Marking 
or  jittering  will  do  better  in  detecting  bimodality. 

Much  of  the  sample  distribution  shsfie  can  be  summarized  by  the  sequence  of  sample  letter 
values,  (cf.  Tukey  (1977),  Chapter  2  of  Hoaglin  et  al.  (1983),  and  Chapter  10  of  Hoaglin  et  al.  (1985)). 
In  our  case,  "M"  denotes  the  sample  median,  the  upper  and  lower  hinges  are  denoted  by  "H”,  the  upper 
and  lower  eighths  are  denoted  by  the  upper  and  lower  sixteenths  by  "D",  thirty-seconds  by  ”C”. 
We  continue  on  in  this  fashion,  where  successive  powers  d’  two  correspond  to  traversing  the  alphabet 
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backwards  i.c.  "B",  "A",  "Z",  etc.,  until  we  hit  the  extremes  of  the  sample.  (Rules  for  dealing  with 
fractions  of  an  observation  are  set  out  in  Tukey  (1977).  Hoaglin  et  al.  (1983),  or  Hoaglin  et  al.  (1985)). 
A  simple  presentation  of  the  position  of  the  letter  values,  including  the  median  and  hinges  (shown  in 
every  boxplot),  but  also  includ^g  the  other  letto-  values,  tells  us  at  least  as  much  as  the  bo}q}loL  Such 
infcsmation  can  be  idaced  in  a  narrow  strip  ot  on  an  axis  at  the  edge  of  the  scatter  plot,  naturally  called 
a  letter  strip.  We  use  letter  labels  here  (as  in  Tukey  (1977),  Hoaglin  et  al.  (1983),  and  Hoaglin  et  al. 
(1985))  avoiding  numerical  labels  naturally  taken  as  fractions,  for  two  reasons: 

(a)  numerical  fractions  look  like  P-values,  so  that  too  many  viewers  would  think  about  significance 

rather  than  distribution  shape,  and 

(b)  no  sequence  decreasing  by  a  constant  ratio  continues  to  have  simple  numbers  over  a  wide  range  - 

thus  using  single  letters  will  reduce  clutter. 

We  have  included  a  partial  translation  of  the  letters  in  Hgure  1  for  the  convenience  of  readers 
unfamiliar  with  the  use  of  letters.  The  letter  values  are  estimates  of  the  corresponding  quantiles  in  the 
true  (unknown)  distribution,  and  can  be  made  more  stable  •  and.  for  many  purposes,  more  useful  •  by 
using  averages  of  the  order  statistics  around  a  given  letter  value.  We  shall  call  such  estimates 
broadened  letter  values  and  will  display  the  broadened  letter  values  by  lower  case  letters,  e.g.  the 
broadened  hinges  will  be  denoted  by  "h".  For  a  precise  technical  description  of  broadened  letter  values, 
see  Mendoza  (1984),  and  rqjpendix.  We  shall  naturally  call  the  broadened  letter  values  b-letters,  and  a 
strip  containing  b-letters  a  b-letter  strip.  (We  have  deferred  to  a  referee  here,  but  plan  to  use  the  word 
bletter  elsewhere). 

An  example  of  a  b-letter  strip  (note  we  have  used  lower  case  letters  for  b-letter  values)  is  shown 
just  beneath  the  tap  of  the  firame  of  Figure  1.  where  a  QQ  plot  of  74  models  of  automobiles  selected  in 
model  year  1979'  has  been  made.  For  purposes  of  comparison,  immediately  below  the  b-letter  strip  is  a 
letter  strip,  where  the  letters  are  ticked  in  using  upper  case.  One  can  clearly  see  where  the  local  gap  at 
around  3000  pounds  has  shifted  the  median  to  the  right,  compared  to  the  mote  stable  broadraed 
median.  We  easily  see  the  left-hand  e  and  h  are  farther  from  the  broadened  median  m  than  their  right- 
hand  analogs,  so  there  is  some  indication  of  left  skewness  in  the  shoulders  of  the  distribution.  When 
we  look  at  the  d’s  and  c’s  (the  left-hand  c  is  coincident  with  tiie  left-hand  b),  however,  we  see  a  clear 
indication  of  skewness  to  the  right 

Hgure  2  is  a  histogram  of  the  same  data,  and  shows  that  a  histogram  can  often  be  considerably 
less  helpful  than  a  letter  or  b-letter  strip,  especially  since  it  is  insensitive  to  high  shoulders  alone.  The 
circles  in  Figure  1  plot  the  b-letter  values  versus  their  anticipated  means.  We  chose  open  circles  so  as 
to  interfere  only  mitumally  with  the  data  points  plotted  on  tire  QQ  plot.  Because  we  feel  a  b-letter  strip 
may  be  a  usefril  addition  to  any  QQ  plot,  we  have  replotted  Hgure  1  in  Figure  6,  where  the  distraction 
of  the  added  letter  strip  has  been  removed. 

Dealing  with  the  other  aspects. 

QQ  plots  make  stq>s  towards  remedying  many  of  the  remaining  problems,  but  require  more 
sophistication  to  interjm.  Here,  we  shall  consider  only  QQ  plots  of  the  ordered  data  against  their  Nor¬ 
mal  scores.  Howevo’,  knowledge  about  the  data,  or  about  assumptions  one  wishes  to  check,  may  some¬ 
times  make  it  appropriate  to  use  order-statistic  scores  for  some  non-Gaussian  distribution.  Much  of  the 
infomation  about  distribution  in  a  QQ  plot  lies  in  the  curvilinear  qqrearairees  shown  by  some  plots. 
For  example,  antisymmetric  appearing  plots  (using  a  symmetric  reference)  correq;K>nd  to  symmetric  data 
distributions;  if  the  rtference  is  Gaussian,  {dots  close  to  straight  liires  correspond  to  data  which  {q)pears 
close  to  Gaussian,  and  by  watching  changes  in  the  apparent  slope  of  the  plot,  one  can  see  whether  tails 
of  the  distributxm  of  the  data  are  long  or  short  tailed  compared  to  the  Gaussian,  or  whether  there  seem 
to  be  two  (a  mme  areas  of  relative  concentration.  Since  each  dau  point  appears  on  the  plot,  ties  and 
gaps  show  op  well  To  gain  such  advantages,  we  use  a  modification  of  the  QQ  plot  in  our  inesentation 
of  the  marginals. 

This  dau  eme  from  the  daubate  (fituibuied  with  the  Bdl  Labi  ftaustical  padu(e  S  and  it  ubulaied  in  dtamben  et 
at  (1983). 


-4- 


As  an  example,  see  Hgure  4.  Several  things  can  be  seen  immediately  firom  this  plot  The  data 
appears  short-tailed  with  respect  to  a  Gaussian  reference  on  the  low  end,  but  the  tail  on  the  high  end 
appears  iq[>proximately  Gaussian.  Several  large  gaps  are  obvious.  The  biggest  gaps  are  at  3000  and 
3500  pounds,  with  r^er  gtqps  noticeable  at  2500  and  4000  pounds,  and  less  obviously  noticeable  at 
3800  pounds.  Even  if  these  gaps  were  removed  by  rigidly  moving  sections  of  the  plot  together  horizon¬ 
tally,  high  shoulders  with  disproportionate  concentrations  near  2000  and  3400  pouiids  would  still  remain 
clear  (from  the  two  intervals  of  steqier  slope)  and  bimodality  would  tend  to  be  suspected.  There  is  a 
tie  between  the  weights  of  the  second  and  third  lightest  model  of  car. 

Scrawl  strips. 

As  mendcned  earlier,  if  one  uses  QQ  plots  to  present  the  marginal  distributions  of  bivariate  data 
alongside  the  scatter  plot,  the  natural  idea  is  to  put  each  QQ  plot  in  a  strip  adjacent  to  the  X  and  Y 
axes  respectively,  using  the  same  scales  for  the  data  on  both  the  strips  and  the  scatter  plot  The  strip  on 
the  top  (CH-  bottom)  thus  plots  X  data  along  the  X-axis  scale,  the  strip  on  the  left  (nr  right)  side  thus 
plots  Y  data  against  the  Y-axis  scale. 

This  creates  a  problem.  If  one  increases  the  size  of  the  scatter  plot  relative  to  the  QQ  plots,  on  a 
piece  of  psper  or  di^lay  device  of  fixed  size,  one  runs  the  risk  of  losing  the  visual  information  in  the 
QQ  plots  due  to  a  much  reduced  scale  on  the  Q-axes  conesponding  to  the  expected  order  statistics,  so 
the  problem  is:  how  does  one  reduce  scale  on  the  Q-axes  while  keq>ing  the  visual  information  in  the 
QQ  plot  obvious?  Scrawl  strips,  art  one  such  method  of  presenting  this  information  about  die  margi¬ 
nals.  The  idea  is  to  presave  the  visual  information,  but  drastically  reduce  the  physical  space  taken  by 
the  plot  The  technique  we  focus  on  here  is  to  first  "fold"  the  OTiginal  QQ  plot,  and  then  to  present  this 
new  plot  together  with  the  scatter  plot^ 

Here  is  a  detailed  description  of  how  one  might  make  a  such  a  plot  by  hand: 

(1)  Prepare  a  the  QQ  plot  by  plotting  the  X  data  along  the  X-axis,  and  the  expected  order  statistics  of 
the  chosen  reference  distribution  on  the  other  axis.  As  a  example,  see  Figure  1. 

(2)  Choose  an  positive  integer  N.  Divide  the  plot  into  N  horizontal  strips,  each  equally  deep. 
Number  these  strips  in  ascending  order,  1  to  N,  from  lowest  to  highest  on  the  plot.  Told"  the 
top  strip,  N,  onto  strip  (N-l)  by  reflecting  all  points  in  strip  N  in  the  dividing  line  between  the 
two  strips  (see  Figure  3). 

(3)  Repeat  the  process  for  the  new  top  strip  (N-1)  and  strip  (N-2).  Continue  until  only  the  bottom 
strip  is  left  The  entire  QQ  plot  is  still  present  but  folded  onto  strip  1,  with  a  consequent  reduc¬ 
tion  in  height  The  final  plot  is  called  a  scrawl  strip.  (We  could  call  it  a  Veq>al  plot  as  the  com¬ 
mon  hornet  Vespa,  folds  its  wings  at  rest  in  a  similar  way  to  the  folding  of  the  QQ-plot  ).  The 
reader  may  wish  to  compare  the  bottom  strip  (scrawl  strip)  of  Figure  4,  produced  fr^  the  QQ 
plot  in  Figure  1,  with  Figure  1. 

(4)  Treat  the  Y  data  similarly,  but  interchange  the  axes  and  fold  along  vertical  strips. 

Steps  (1)  •  (3)  produce  a  strip  with  a  trace  in  tegular  jigs  (upwards)  and  jags  (downward),  as  can 
be  seen  by  inflecting  the  scrawl  plots  in  Figure  4.  Slopes  will  be  preserved,  except  for  a  sign  change 
in  jags.  This  means  in  low  absolute  skpe  portions  of  die  trace,  there  will  be  shallow  qipearing 
"peaks",  and,  in  large  absdute  slope  portions,  diese  "peaks"  will  appear  steeper  and  denser.  Hence 
information  about  whether  the  tails  are  long  or  short  (conpated  to  the  reference  distribution)  is  easily 
seen  on  the  scrawl  plot  as  a  changing  density  of  "peaks",  or  in  more  detail  as  a  changing  absolute  value 
of  the  local  slope. 

An  alternative  to  the  folding  described  in  steps  (2)  and  (3)  is,  at  each  stage,  to  directly  translate 
the  part  of  the  plot  in  the  remaining  uppermost  strip  ^wn  onto  the  strip  below  it  This  possibility,  a 
modular  at  sliced  strip,  described  and  iUustrated  in  Veitch  (1984),  we  find  less  satisfactmy. 

^Source  code  for  the  scnwl  itrip  roatioet  (wiiBen  in  the  interfaoe  Uiifnage  of  the  natisiical  package  S)  ii  avaQaUe 
frem  the  authon  oa  leqneat.  (Write  to  Jamea  G.  Veitch  at  Franz,  Inc.,  Suite  270, 1141  Karim  Parkway,  Alameda, 

CA  94501.) 
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Hguit  4  is  a  scatter  plot  of  auto  weight  versus  price  for  ^  same  74  makes  of  automobile  given 
in  Figure  1.  It  is  bordered  by  two  scrawl  plots,  one  for  each  marginal.  The  X  scrawl  plot  is  presented 
below  the  scatter  plot,  using  the  same  coordinates.  The  Y  scrawl  plot  is  similarly  presented  alcmgside 
the  scatter  plot.  Note  that  we  might  instead  put  the  X  scrawl  plot  above  the  scatter  plot  and/or  the  Y 
scrawl  plot  on  the  right  side  of  the  scatta*  plot.  We  have  "ticked''  the  values  of  the  broadened  median, 
hinges,  and  eighths  and  etc.,  of  the  marginals  in  b-letter  strips  between  the  scrawl  plots  and  the  scatter 
plot. 

The  information  present  in  the  tHiginal  QQ  plot  (Figure  1)  is  pretty  much  preserved  in  the  X 
scrawl  strip.  Gaps  and  ties  are  more  obvious;  slope  changes  may  be  a  little  less  obvious,  but  one  can 
still  see  that  distribution  of  auto  weights  is  somewhat  shtM-tailed  compared  to  a  Gaussian  reference. 
The  bimodality  of  die  weight  distribution,  with  concentrations  near  2000  and  3400  pounds,  is  at  least  as 
clear  as  in  the  QQ  plot.  The  price  distribution  is  very  long-tailed  on  the  high  side  -  -  this  is  s^jparent 
even  in  the  scatter  plot  itself.  A  tie  shows  up  at  the  lowest  price.  Many  methods  of  analysis  assume 
symmetry  in  the  distribution.  If  one  wishes  to  symmetrize  the  distribution  of  prices,  given  the  observed 
skewness,  a  ccmunonly  suggested  procedure  is  to  take  logs.  The  outcome  of  this  procedure  is  displayed 
in  Hgure  S,  where  one  can  see  die  scrawl  strip  of  log  i»ices  that  the  distribution  is  still  somewhat 
long  tailed  on  the  high  side.  At  this  point,  if  one  wished  lo  symmetrize  or  Gaussianize  the  distribution 
of  prices,  it  would  probably  be  worth  while  to  work  only  with  the  univariate  distribution  of  prices  -  e.g. 
one  might  look  at  QQ  plots  or  suspended  rootograms  for  different  transformations  of  the  price  data. 

Figure  7  shows  scatter  plots  and  scrawl  strips  of  sepal  length  and  sepal  width  for  the  famous  /ris 
data  on  which  Fisher  (1936)  illustrated  the  use  of  the  ^scriminant  function.  Only  two  of  the  three 
species  are  in  this  plot,  virginica  and  versicolor  (the  original  data  also  includes  petal  length  and  petal 
width).  We  exclude  the  third  species  (setosa),  as  it  is  clearly  differaitiated  finom  the  other  two  species 
on  inspection  of  a  scatter  plot  of  all  three  (see  Chambers  et  al.  (1983)  page  108,  for  plots  of  all  three 
species  showing  petal  width  against  petal  length  •  setosa  is  easily  distinguished,  as  would  be  the  case 
here).  The  large  number  of  ties,  due  to  rounding  of  die  basic  measurements  for  both  sepal  width  and 
sepal  length  shows  clearly  in  the  scrawl  strips.  If  we  wish  to  avoid  distraction  of  the  eye  by  these  ties 
we  may  "jitter”  each  cocndinaie  by  adding  a  small  random  penurtation  to  each  value  (in  the  example 
these  additions  simulate  a  uniform  distribution  between  0  and  0.1,  so  that  the  resulting  intervals  abut). 

Figure  8  shows  shows  the  scatter  plots  and  scrawl  strips  for  the  jittered  coordinates.  All  trace  of 
ties  has  disappeared  and  we  can  see  more  clearly  that  sepal  length  is  somewhat  skewed  to  the  right. 
There  is  also  an  apparent  area  of  concentration  relative  to  the  Gaussian  near  the  lower  eighth  (5.5  cms) 
for  sepal  length,  and  a  gap  at  about  7.5  cms. 

Figure  9,  examines  the  two  hard-to-sqiarate  Iris  species,  using  information  from  all  4  of  the 
flower  dimensions.  Each  dimension  was  first  put  on  a  relative  scale  by  dividing  by  the  median  (for  vir¬ 
ginica  and  versicolor  combined)  for  that  coordinate  and  the  results  were  then  assembled.  The  horizon¬ 
tal  scale  refers  to  the  natural  "size"  measure,  namely  die  sum  of  all  four  relative  dimensions  (hence  cen¬ 
tered  near  4),  while  the  vertical  scale  refers  to  die  natural  "shape"  measure,  the  sum  of  the  two  relative 
lengths  minus  the  sum  of  the  two  relative  widths  (hence  centered  near  zero).  The  horizontal  scrawl 
strip  shows  smaller  peak-to-peak  spacings  at  the  ends  than  in  the  middle,  as  would  be  expected  {<x  rela¬ 
tive  bimodality.  The  veitic^  scrawl  strip  shows  a  gradation  fixim  wide  placing  on  the  -  side  to  narrow 
placing  on  the  side,  as  would  correspond  to  skewness. 

Figure  10  shows  the  rame  scuter  diagram,  with  the  two  species  identified.  Their  overltg)  in 
"size"  is  small  -  •  presumably  small  enough  to  account  for  the  appearance  of  relative  bimodality  in  Fig¬ 
ure  9’s  scrawl  str^  to  size.  We  can  also  see  that  the  i^nead  of  versicolor's  shape  is  much  greater  dian 
that  of  virginica,  and  extends  much  furthu  to  negative  ^oes  than  to  positive.  Even  if  the  distributions 
of  the  individual  species  were  Gaussian,  dds  unbalance  could  account  to  the  sort  of  skewness  qiparent 
in  Figure  9’s  scrawl  str^)  to  the  shape  variate. 

This  example  shows  that  soawl  strips  can  be  effective  in  revealing  even  moderate  amounts,  either 
of  relative  bimodality  or  of  skewness. 

Figure  11,  using  petal  length  and  petal  width  recentered  separately  to  each  of  the  3  Iris  species, 
offers  a  synthetic  example  of  the  detection  of  heavy  tails  due  to  heterogeneous  variability  -  -  both 
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scrawl  plots  are  consistently  steeper  in  the  middle  than  toward  the  tails.  The  reason  for  the  heavy- 
tailedness  is  shown  in  Figure  12,  wh»e  the  recentered  results  for  the  3  species  are  distinguished. 
Clearly  versicolor  shows  greater  variability  than  the  other  two  (as  it  should  according  to  Anderson’s 
explanation  of  its  OTigin  by  inirogressive  hybridization  of  the  other  two  species). 

Parameters  controlling  the  appearance  of  scrawl  plots. 

In  addition  to  choosing  the  width  of  the  scrawl  plots,  one  must  make  two  further  choices  in  this 
procedure.  One  must  choose  the  reference  distribution  with  which  one  desires  a  comparison,  possibly 
because  of  a  wish  to  visually  check  on  certain  assumptions,  and  one  must  choose  the  number  of  folds, 
N .  The  Gaussian  distribution  will  probably  be  the  usual  choice,  and  is  the  one  we  have  implemented  in 
the  plots  presented  here.  For  a  given  set  of  data,  and  a  given  reference  distribution,  the  number  of 
folds,  and  the  overall  size  of  the  str^  control  the  visual  impact  of  the  scrawl  strip.  One  would  like  to 
have  scrawl  strips  of  data  which  come  from  similar  distributions,  but  with  a  different  sample  size,  look 
similar.  The  number  of  folds,  N,  we  make  does  not  seem  to  influence  the  visual  impact  very  much.  A 
stronger  factor  often  seems  to  be  the  visually  perceived  slopes  of  the  trace  in  the  scrawl  strip.  To  con¬ 
trol  both  N  and  slope,  we  would  need  to  make  the  width  of  the  scrawl  strips  data  dependent;  in  our 
basic  programs  we  chose  to  fix  the  widths  and  control  the  slope. 

This  choice  about  visual  appearance  of  the  scrawl  strip  implies  that  any  algoithm  to  produce  a 
plot  must  know  the  physical  dimensions  of  the  final  scrawl  strip.  For  concreteness,  suppose  that  we 
wish  to  inx>duce  the  X  variable  scrawl  strip,  and  that  this  plot  will  have  dimensions  w  inches  in  width 
and  h  inches  in  height  Denote  the  oidettA  X  data  by  X(i),  ■  *  * ,  and  convert  these  to  physical  dis¬ 
tances  from  the  left  edge  of  the  strip  (in  inches).  Denote  these  distances  by  ■  *  • ,  Denote 
the  reference  scores  by  Z(i),  *  ■  • ,  z^).  Denote  the  index  of  the  lower  2Sth  percentile  (the  lower  quar- 
tile)  by  L  and  the  index  of  die  upper  2Sth  percentile  (the  upper  quartile)  by  [/ ,  so  these  are  given  by 
X(L)  and  X(U)  for  the  data,  Z(^)  and  Z(U)  for  the  reference  scores  respectively.  Suppose  that  the  modulus 
of  the  visual  "slt^"  of  the  segments  is  to  be  set  to  tqiproximate  s.  Our  algorithm  proceeds  in  several 
steps: 

(1)  Prepare  a  standard  QQ  plot  with  width  w,  height  A'  to  be  detemined.  We  set  the  distance  (in 
inches)  of  Z(i;)  to  Z(L)  to  be  d  »  s.  (Dx(U)-Dx(L)),  so  the  visual  slope  from  the  lower  quartile  to  Ae 
upper  quartile  is  s.  Then  the  total  height  h'  needed  is  given  by 

h'  =  d.(Z(n)-Z(i))/(Z(U)-Z(]L)) 

Note  that  d  depends  only  on  the  width  of  the  scrawl  strip  w  and  the  choice  of  where  to  plot  the 
X  data,  and  A'  only  dqiends  on  the  sh^  of  the  QQ  plot,  not  on  its  vertical  scaling. 

(2)  Determine  the  number  of  strips  N  needed  to  fold  the  plot  of  step  (1)  into  the  physical  space  allot¬ 
ted,  A ,  by  dividing  the  length.  A' ,  found  in  step  (1)  by  A  and  rounding  up  to  the  nearest  integer. 

(3)  If  A'  is  smaller  than  A ,  the  QQ  plot  is  merely  expanded  to  height  A  with  no  folding.  This  might 
hqtpen  if  the  slope  constant  s  were  small  (usually  a  poor  choice),  or  if  the  data  distribution  were 
extremely  long-tfdkd  (when  a  different  scatter  plot  is  likely  to  be  an  improvement). 

In  examples  we  have  tried,  we  have  found  that  choosing  zsi  or  2  is  often  reasonable  (see  Veitch 
(1984));  the  plots  given  here  are  in  this  range  (in  other  circumstances  we  might  inefer  to  fix  the  number 
of  folds),  liie  choice  of  the  quartUes  (rather  than,  say,  eighths)  to  determiite  sk^  might  be  argued; 
however  fliis  also  seems,  frran  ■iq)ection  of  plots,  to  work  reasonably  weD. 

Some  other  appUcations. 

Scrawl  strips  -  -  should  not  be  thought  of  as  restricted  to  be  an  appurtenance  to  scatter  plots. 
Once  the  tools  to  make  them  are  available,  they  can  be  used  anywhere  a  box  plot  or  other  graphical 
representation  or  abbreviation  of  a  distribution  would  be  in  orda.  We  offer  three  examples,  in  die 
belief  that  any  interested  user  will  be  aUe  to  recognize  and  make  good  use  of  many  more. 

All  three  of  these  examples  invoNe  comparing  two  or  more  scrawl  strips.  For  comparisons,  it 
seems  desirable  to  blend  the  b-letter  strip  infmmation  into  the  scrawl  strip.  We  shall  do  this  by  putting 
light  (dotted  or  sdid)  verticals  (horizontals  if  the  scrawl  is  associated  with  a  vertical  axis)  at  the 
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locations  of  "b",  "e",  "m",  "e",  ana  "b"  (whose  unit-Gaussian  locations  are  roughly  -1.5,  -.85,  0,  .85, 
and  4-1. 5).  labelling  these  voticals  unobtrusively,  and  indicating  the  other  b-lett^  values  by  small  ticks 
without  labels. 

Hnally,  we  could  consider  two  q)proaches  to  the  use  of  scrawl  strips  to  show  univariate  behavitn- 
in  a  "splom"  or  scatter-plot  matrix.  In  the  first,  the  scrawls  are  laid  in  diagonally  in  the  (otherwise 
empty)  boxes  on  the  main  diagonal  In  the  second,  each  diagonal  box  shows  (two  or)  three  scrawls, 
laid  in  horizontaliy,  and  referring  to  that  variable  and  each  of  those  shown  in  adjacent  columns  of  the 
matrix.  (To  make  this  plot  most  effective,  we  may  wish  (1)  to  rescale  all  variables  to  a  commcm 
(robust)  scale  and,  possibly,  (2)  order  the  variables  in  a  way  related  to  their  univariate  distribution. 

Warning.  One  potential  for  trouble  with  Q-Q  plots,  as  with  most  exploratory  tools,  is  when  an  appear¬ 
ance  is  erroneously  regarded  as  an  establi^ed  truth.  This  can  arise  when  all  three  of:  (a)  it  is  reason¬ 
able  to  regard  the  data  at  hand  as  a  sample,  (b)  we  choose  to  be  concerned  with  the  population,  not  the 
sample,  and  (c)  we  have  either  not  found  or  not  used  a  significance  or  confidence  procedure  (exact  or 
approximate)  relevant  to  the  appearance  that  concerns  us.  (Notice  that  (a)  fails  for  the  "cars”  and 
"same  house"  examples  above.)  So  far,  a  useful  significance  procedure  for  wigglyness"  in  a  Q-Q  plot 
seems  not  to  be  available.  (We  hqre  to  return  to  this  questiem,  elsewhere.) 

Stylized  scrawl  strips. 

For  purposes  where  details  like  ties  and  gaps  are  not  needed  (are  inconsequential  and/or  confus¬ 
ing),  it  may  pay  to  stylize  one’s  scrawl  strips  by  (a)  using  smoothing  to  fix  the  comers  of  the  scrawls, 
(b)  plotting  each  slant  of  the  scrawl  as  a  straight  line  connecting  two  vertices,  and  (c)  fixing  the  number 
of  slants  rather  than  the  width  of  the  scrawl  strip. 

If  <  i  I  fi  >  represents  the  reference  position  for  the  ith  of  a  sample  of  n,  let  m*  be  the  least 
integer  as  large  as  1.1  Va  ,  and  plan  to  have  m*  slants  by  taking  the  height  (before  folding)  of  a  slant 
tobeD  =  (</il/i  >-<ll/i  >  )/m*  .  We  then  find  the  internal  vertices  by  regressing  x,  on 
<  i  I  n  >  for  { *s  satisfying 

<  1 1  n  >  +  0-'^)D  :S<iln>S<ll«  >  +  0+'j)D 

and  inserting  <  1,  n  >  -i-  JD  in  the  regression  to  find  the  jth  vertex,  ;=!  ,2 . These  vertices 

are  then  plotted  alternately  at  the  top  and  bottom  of  the  scrawl,  and  connected  by  straight  lines.  The 
points  falling  in  the  end  slants  are  then  plotted  individually  as  before. 

Conclusions. 

Presenting  marginal  data  distributions  on  the  edges  of  a  bivariate  scatter  plot  by  providing  b-letter 
strips  (ot  letter  strips)  to  mark  the  b-letter  values  (or  letter  values),  allows  considerable  information 
about  symmetry,  center  and  spread  of  these  marginals  to  be  presetted  in  minimal  space.  Such  letter  (or 
b-letter)  strips  may  be  used  to  advantage  in  other  types  of  plots  as  well  (e.g.  in  QQ  plots).  Additional 
use  of  scrawl  strips  on  the  edges  of  a  scatter  plot  allows  other  features  of  the  marginal  <tata  distributions 
to  be  easily  seen  together  with  the  original  diua.  Since  scrawl  strips  are  modified  QQ  plots  they  inherit 
useful  features  of  plots,  including  information  about  die  sh^  of  the  marginal  distribution,  the  pres- 
eiKe  and  type  of  skewness,  the  presence  of  heavy  tails,  gaps  and  ties  in  the  ordered  data,  appearances 
of  bimodality,  and  relation  to  Gaussianity.  This  type  of  information  is  either  absent  or  depicted  poorly 
both  in  the  original  scatter  plot  and  also  in  most  univariate  presentations  such  as  histograms,  rooto- 
grams,  boxplots.  and  simply  "ticking"  in  or  jittering  in  the  marginal  values. 
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Appendix:  Extending  a  finite  batch  to  an  unlimited  distribution. 

If  ^11  ^  >11-1  ^ . ^  >2  :S  yi,  is  an  (vdeied  batch,  aitd  there  are  not  ties,  it  is  convenient 

(1)  to  assign  p,-  «  (3i-l)/(3n+l)  to  y,-,  and 

2)  to  interpolate  linearly  between  (p^  ,  y,)  and  (Pi,^i.  y,>i. 

It  remains  then  to  deal  with  p-values  outside  [yj,  y.]  and  (b)  with  ties. 

A  reasonable  choice  is  to  take  m  as  the  integer  part  of  Vn ,  and  to  extrapolate  (Pi,yi)  and 
(Pm .  >»i)  exponentially  for  smaller  (p.  y).  This  means  taking 

(*)  In  p  «  A  +  By 

fitting  A  and  B  to  (pi,  yi)  and  (p^*  J’m)  >iid  then  using  (*)  for p  < Pi,  y  <  yi. 

A  similar  extrapolation  for  (p, ,  y,)  (p«+i-,» ,  y,+i-«)  applies  for  p  >  p„  y  >  y, . 

For  ties,  suppose  that  yy  =  yy>i=...^t  is  the  full  extent  of  one  tie.  Let  iPm-ym)  correspond  to 
((J+kyi)  the  median  of  the  tie.  Put 

Vl  =  (yy-i  +  yjy2).  ye  =(yif^  y*+iy2 , 

PL  =  (P/-1“P/).  Pit  =  (Plt*PK*l)^ 


and  plan  to  do  linear  interpolation  between  (px,,yi,)  and  (pm^ym)  and  al'o  between  (pm,ym)  and 
(Pk  >  J'x)-  Gf  the  tying  is  due  to  a  regular  pattern  of  possible  values  we  should  replace  yy-i  by  die  next 
lower  possible  value  and  y^^j  by  the  next  higher  possible  value.) 

If  there  are  no  ties,  (pm<ym)  ^  (Pj<yj)  and  (pL,yt)  is  the  midpoint  of  the  segment  joining 
(Pj-u  yj-i)  to  (Pj^yj)  so  th^  interpolation  tetween  (pi.Jh,)  and  (pu>ym)  is.  in  this  ^lecial  case  the 
same  as  that  betwm  (py-i,  yj-\)  and  (py,  yy).  Thus  the  procedure  for  ties  reduces  gracefiilly  to  the 
I^ocedure  for  no  ties  •  •  and  can,  if  it  is  mme  convenient,  be  always  qipUed. 

For  the  small-group  large-group  case,  where  {q„,  z„)  may  rqxesent  the  smaller-group’s  observa- 
titms  with 


9.  = 


3h-l 

3n,+l 


but 


Pi  - 


3i-l 

3»i,-i-l 


NOTE:  Leuen  oted  with  yttn  on  John  Tiikey'i  paUicatians  oorretpood  to  bibiiofraptuet  in  all  vohiniet  of  his  collect¬ 
ed  papen. 
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where  n,  <  n, .  we  find  a  standard  value  for  by  equating  q,^  and  p, .  giving 

1  1 1  1  3fc-l 

/  =  j  +  j(3i.,+i)a  =  j  + 

3n,+l 


3  3n,+l 


(A-y) 


and  then  interpolating  for  i .  (We  may  have  to  face  ties;  we  will  not  have  to  extrapolate.) 

In  our  example,  California  cities  with  ranks  (from  above)  22  to  25  had  51.5%  "stayers".  Further 
n,  »  73,  n,  «  475.  The  median  rank  of  23.5  converts  to 

1=1  + -^(23.5-1)=  151.16 

In  die  non-California  distribution  151  falls  on  a  53.7%  tie,  while  152  falls  a  tl.6%  tie.  Thus  (151.5, 
53.65%)  and  (150,  53.70%)  are  to  be  interpolated  to  151.16,  giving  53.66%  as  the  standard  value  to  be 
used  in  the  single-scrawl  Figure  17. 


Figure  1 

QQ  plot  for  weights  of  74  autos 
in  the  1 979  model  year 
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Figure  2 

Histogram  of  weights  of  74  autos 
in  the  1979  model  year 
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Figure  3 

QQ  p!o!  after  the  first  fold 
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Figure  4 

Statistics  for  74  auto  models  of  1 979 
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Figure  5 

Statistics  for  74  auto  models  of  1979 
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Figure  6 

QQ  plot  for  weights  of  74  autos 
in  the  1979  model  year 


Figure  7 

Fisher's  Iris  data 

(virginica  and  versicolor) 
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Figure  8 

Fisher's  Iris  data  (jittered) 
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Figure  9 

Transformed  Iris  data 
(virginica  and  versicolor) 
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Figure  10 


Transformed  iris  data 

(Stars  plot  virginica,  diamonds  plot  versicolor) 
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Figure  11 


Iris  data  (all  varieties) 
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Figure  12 


Transformed  Iris  data . 

Stars  s=  virginica,  diamonds  =  versicolor,  circles  =  setosa 
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